Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update REPL.fuzzyscore to use string distance #50412

Merged
merged 1 commit into from
Aug 14, 2023

Conversation

jakobnissen
Copy link
Contributor

The old heuristics were not particularly helpful. These new heuristics should be easier to reason about, since the score is between 0 and 1, and also yield much more intuitive results.

Closes #49466
See #49562

@jakobnissen
Copy link
Contributor Author

@tecosaur I've used your implementation as suggested in the issue. What's your name so I can attribute?

stdlib/REPL/test/docview.jl Outdated Show resolved Hide resolved
@jakobnissen jakobnissen added the stdlib:REPL Julia's REPL (Read Eval Print Loop) label Jul 5, 2023
@jakobnissen
Copy link
Contributor Author

CI failure is a spurious failure in 32-bit Windows only (as is tradition)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 6, 2023

IIUC, this is copied from https://github.com/tecosaur/DataToolkitBase.jl/blob/a0aa954c0f1aa3b7578c92021e7017987459fb6f/src/model/utils.jl#L43-L44, which was added by https://github.com/tecosaur/DataToolkitBase.jl/commit/8be9e3ce6333f4b0877f5ed62bff6de87a5a4c07.patch
(except the changed variable names to remove unicode?). I would suggest adding this credit to the squash commit, from that patch:

Co-authored-by: TEC <git@tecosaur.net>

@jakobnissen jakobnissen added the status:merge me PR is reviewed. Merge when all tests are passing label Jul 7, 2023
@jakobnissen
Copy link
Contributor Author

Squashed with attribution as suggested, so that the person clicking "squash and merge" don't have to remember adding attribution to the commit. The commit content is identical to that already approved.

@tecosaur
Copy link
Contributor

tecosaur commented Jul 7, 2023

I'm happy with this attribution 🙂 (glad my work was useful Jacob!), but for the record I also wouldn't be fussed if you left it out. My implementation is a variation on the StringDistances.jl implementation, which is itself a reimplementation of a C# implementation, which is itself a variation on some other implementation, etc. — it's all rather muddy.

I've also been thinking about whether I can introduce a half-penalty capitalisation transition, which could also be good here if I can work it out, but we can deal with that if/when it actually happens.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 7, 2023

This code then may be claiming copyrights under https://www.codeproject.com/info/cpol10.aspx, which the FSF does not consider compatible with the MIT license that people have subsequently attempted to apply to it and cannot be linked against GPL code. (https://www.gnu.org/licenses/license-list.html#cpol)

https://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm-2

@vtjnash vtjnash added status:DO NOT MERGE Do not merge this PR! and removed status:merge me PR is reviewed. Merge when all tests are passing status:DO NOT MERGE Do not merge this PR! labels Jul 7, 2023
@tecosaur
Copy link
Contributor

tecosaur commented Jul 7, 2023

Where did https://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm-2 come from? The C# implementation I was talking about is http://blog.softwx.net/2015/01/optimizing-damerau-levenshtein_15.html, which is an implementation of one of the algorithms listed on the Wikipedia page (https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance#Distance_with_adjacent_transpositions), which seems to come from an ACM summary article (https://dl.acm.org/doi/10.1145/1963190.1963191).

This is what I'm referring to by "it's all rather muddy". My suspicion is that this code actually isn't copyrightable, since at the end of the day it's a rather straightforward implementation of a very well-explored algorithm.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 7, 2023

Looking closer at the original codeproject, I found the original code there, and I think the code in the blogpost is sufficiently different from it as to be merely "inspired by" and not "based on" that code. They appear to effectively implement a different pseudo-code, since the blog optimized and reorganized it in a way that made the result quite different from the original.

We should add a second line to the commit message to satisfy the licensing terms of the StringDistances.jl package then also:

Co-authored-by: matthieugomez <gomez.matthieu@gmail.com>

And it should be good to merge now.

We could add one more line for the blog post (which is also licensed as MIT), but the author there appears to have assigned the copyright in that license over to "anonymous".

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 7, 2023

(note that these co-authored-by lines are not strictly required for licensing, since both authors have already been added as copyright holders to the project as a whole via their past commits: https://github.com/JuliaLang/julia/blob/master/LICENSE.md)

@tecosaur
Copy link
Contributor

tecosaur commented Jul 7, 2023

I'm not sure about that being the "original code", since that "original code" is from 2012, but the summary paper with the algorithm is from 2011, and whatever the source is in that summary paper, presumably is pre-2011 and likely came with an implementation.

@tecosaur
Copy link
Contributor

tecosaur commented Jul 7, 2023

Oh also, I referenced the StringDistances.jl implementation, that C# implementation + description in that blog post, and the Wikipedia article when writing my version. So... more mud? 😅. I'm still on "I don't think this is copyrightable".

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 7, 2023

Sorry, I meant original code with respect to the codeproject implementation. In comparison to the blog post implementation then, they don't look to me like they share ancestry, other than a comment in the blog that "this is based on Sten Hjelmqvist's" algorithm at that codeproject link. It seems to me now that the comment was intended to credit him merely for researching into the algorithm, not the actual implementation.

My suspicion is that this code actually isn't copyrightable, since at the end of the day it's a rather straightforward implementation of a very well-explored algorithm

In the US, that question hasn't really been settled yet, c.f. https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc. (other jurisdictions may have different interpretations here)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 7, 2023

Oh also, I referenced the StringDistances.jl implementation, that C# implementation + description in that blog post, and the Wikipedia article when writing my version. So... more mud? 😅. I'm still on "I don't think this is copyrightable".

That is all fine. You cannot copyright an algorithm, and you can even copy directly from multiple sources as long as the licenses are all compatible (all are MIT in this case).

Aside: I find it amusing that the licensing question is happening on a PR about string distances, since that is quite related to what licensing itself is about measuring.

@jakobnissen
Copy link
Contributor Author

What needs to be done to push this through? From reading the comments it looks like all is in order.

The old heuristics were not particularly helpful. These new heuristics should
be easier to reason about, since the score is between 0 and 1, and also yield
much more intuitive results.

Co-authored-by: TEC <git@tecosaur.net>
Co-authored-by: matthieugomez <gomez.matthieu@gmail.com>
@jakobnissen
Copy link
Contributor Author

Bump. This is now rebased, and I added Matthieu Gomez as a co-author in the commit. Should be good to merge if CI passes, I think

@jakobnissen jakobnissen added the status:merge me PR is reviewed. Merge when all tests are passing label Aug 11, 2023
@KristofferC KristofferC merged commit a9116ef into JuliaLang:master Aug 14, 2023
5 of 7 checks passed
@jakobnissen jakobnissen deleted the fuzzy branch August 14, 2023 12:18
@jakobnissen jakobnissen removed the status:merge me PR is reviewed. Merge when all tests are passing label Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib:REPL Julia's REPL (Read Eval Print Loop)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REPL.fuzzyscore heuristics could be better
6 participants